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Abstract 

We  propose  a  novel  algorithm  called  Evolutionary  Policy  Iteration  (EPI)  for  solving  infinite 
horizon  discounted  reward  Markov  Decision  Process  (MDP)  problems.  EPI  inherits  the  spirit  of 
the  well-known  PI  algorithm  but  eliminates  the  need  to  maximize  over  the  entire  action  space  in 
the  policy  improvement  step,  so  it  should  be  most  effective  for  problems  with  very  large  action 
spaces.  EPI  iteratively  generates  a  “population”  or  a  set  of  policies  such  that  the  performance 
of  the  “elite  policy”  for  a  population  is  monotonically  improved  with  respect  to  a  defined  fitness 
function.  EPI  converges  with  probability  one  to  a  population  whose  elite  policy  is  an  optimal 
policy  for  a  given  MDP.  EPI  is  naturally  parallelizable  and  along  this  discussion,  a  distributed 
variant  of  PI  is  also  studied. 

Keywords:  (Distributed)  policy  iteration,  Markov  decision  process,  genetic  algorithm,  evolution¬ 
ary  algorithm,  parallelization 
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1  Introduction 

We  propose  a  novel  algorithm  called  Evolutionary  Policy  Iteration  (EPI)  to  solve  Markov  Decision 
Processes  (MDPs)  for  an  infinite  horizon  discounted  reward  criterion.  The  algorithm  is  especially 
targeted  to  problems  where  the  state  space  is  relatively  small  but  the  action  space  is  extremely  large, 
so  that  the  policy  improvement  step  in  Policy  Iteration  (PI)  becomes  computationally  impractical. 
EPI  eliminates  the  operation  of  maximization  over  the  entire  action  space  in  the  policy  improvement 
step  by  directly  manipulating  policies  via  a  method  called  “policy  switching”  [1]  that  generates 
an  improved  policy  from  a  set  of  given  policies.  The  computation  time  for  generating  such  an 
improved  policy  is  on  the  order  of  the  state  space  size.  The  basic  algorithmic  procedure  imitates 
that  of  standard  genetic  algorithm  (GAs)  (see,  e.g.,  [5]  [10]  [9])  with  appropriate  modifications 
and  extensions  required  for  the  MDP  setting,  based  on  an  idea  similar  to  the  “elitism”  concepts 
introduced  by  De  Jong  [3].  In  our  setting,  the  elite  policy  for  a  population  is  a  policy  obtained  via 
policy  switching  that  improves  the  performances  of  all  policies  in  the  population.  EPI  starts  with 
a  set  of  policies  or  “population”  and  converges  with  probability  one  (w.p.  1)  to  a  population  of 
which  the  elite  policy  is  an  optimal  policy,  while  maintaining  a  certain  monotonicity  property  for 
elite  policies  over  generations  with  respect  to  a  fitness  value. 

The  literature  applying  evolutionary  algorithms  such  as  GAs  for  solving  MDPs  is  relatively 
sparse.  The  recent  work  of  Lin,  Bean,  and  White  [6]  uses  a  GA  approach  to  construct  the  minimal 
set  of  affine  functions  that  describes  the  value  function  in  partially  observable  MDPs,  yielding  a 
variant  of  value  iteration.  Chin  and  Jafari  [2]  propose  an  approach  that  maps  heuristically  “simple” 
GA  [10]  into  the  framework  of  PI.  However,  their  evolutionary  operations  do  not  include  policy 
switching,  and  convergence  to  an  optimal  policy  is  not  always  guaranteed. 

As  noted  earlier,  the  main  motivation  for  the  proposed  EPI  algorithm  is  the  setting  where 
the  action  state  space  is  finite  but  extremely  large.  In  this  case,  it  could  be  computationally 
impractical  to  apply  exact  PI  or  value  iteration,  due  to  the  requirements  of  maximization  over  the 
entire  action  space  via  e.g.,  enumeration  or  random  search  methods.  On  the  other  hand,  local 
search  cannot  guarantee  that  a  global  maximum  has  been  found.  Thus,  the  monotonicity  in  the 
policy  improvement  step  is  not  preserved.  The  proposed  EPI  algorithm  preserves  an  analogous 
monotonicity  property  over  the  elite  policies  in  the  populations. 

A  primary  contribution  of  our  work  is  the  use  of  a  (random)  evolutionary  search  algorithm  in  the 
context  of  MDPs  with  a  convergence  guarantee  (w.p.  1).  Another  contribution  is  the  development 
of  a  parallelizable  algorithm  for  solving  MDP  problems  exactly  via  policy  switching.  We  partition 
the  policy  space  with  nonoverlapping  subsets  of  the  policy  space  and  then  apply  EPI  or  PI  into 
each  subset  in  parallel.  Distributed  EPI  applies  policy  switching  to  (convergent)  elite  policies  for 
the  subsets,  obtaining  an  optimal  policy  for  the  original  policy  space  (see  Section  4). 

This  note  is  organized  as  follows.  We  start  with  the  problem  setting  and  necessary  background 
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on  MDPs  in  Section  2.  In  Section  3,  we  formally  describe  the  EPI  algorithm  with  detailed  discussion 
and  present  the  convergence  proof.  In  Section  4,  we  study  a  distributed  variant  of  PI  and  discuss 
how  to  speed  up  EPI  by  parallelization.  We  then  conclude  with  some  remarks  in  Section  5. 


2  Background 


Consider  an  MDP  with  finite  state  space  X,  finite  action  space  A,  reward  function  it!  :  X  x  A  — >  1Z, 
and  transition  function  P  that  maps  a  state  and  action  pair  to  a  probability  distribution  over  X. 
We  denote  the  probability  of  transitioning  to  state  y  E  X  when  taking  action  a  in  state  x  E  X  by 
P(x,a)(y).  For  simplicity,  we  assume  that  every  action  is  admissible  in  every  state. 

Let  II  be  the  set  of  all  stationary  policies  n  :  X  — >  A.  Define  the  optimal  value  associated  with 
an  initial  state  x  E  X: 


V*(x) 

V*{x) 


max  V% (x).x  E  X ,  where 

7tGII 


E 


'Y^ytR{xtin{xt)) 


,t=  o 


Xq  =  X 


,  x  E  X,  0<7<l,7rElI, 


where  X/  is  a  random  variable  denoting  state  at  time  t  and  7  is  the  discount  factor.  Throughout 
the  paper,  we  assume  that  7  is  fixed.  The  problem  we  address  is  that  of  finding  an  optimal 
policy  7 r*  that  maximizes  the  expected  optimal  value  for  an  initial  state  distributed  with  probability 
distribution  d ,  i.e., 

7r*  E  argmaxE?  [E^Xo)] ,  Xq  ~  S.  (1) 

?ren 

Policy  iteration  (PI)  can  be  used  to  solve  (1).  For  a  given  initial  state,  PI  computes  an  optimal 
policy  in  a  finite  number  of  steps,  because  there  are  a  finite  number  of  policies  in  II,  and  PI 
preserves  the  monotonicity  in  terms  of  the  policy  performance.  The  PI  algorithm  consists  of  two 
parts:  policy  evaluation  and  policy  improvement.  Let  B(X)  be  the  space  of  real- valued  bounded 
measurable  functions  on  X.  We  define  an  operator  T  :  B(X)  — >  B{X)  as 


T(<F)(a:)  =  max  <{  R(x,a)  +7  ^  P(x,  a)(y)<I>{y)  ^  ,  <F  E  B(X),x  E  X, 

yex 


(2) 


and  similarly,  an  operator  :  B(X)  — >  B(X)  for  7r  E  II  as 


TA$)(x)  =  R(x,7T(x))  P(x,n(x))(yWy),  $  G  B(X),x  E  X.  (3) 

yex 

It  is  well  known  (see,  e.g.,  [8])  that  for  each  policy  7r  E  II,  there  exists  a  corresponding  unique 
'I*  E  B  ( X )  such  that  for  x  E  X, 


T7r(4>)(a:)  =  $(2:)  and  4>(a:)  =  Vw(x). 
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The  policy  evaluation  step  obtains  Vn  for  a  given  7 r  via  (3),  and  the  policy  improvement  step 
obtains  ^  G  II,  e.g.,  as  the  argument  of  the  right-hand  side  of  (2),  such  that 

T(V*)(x)  =Ti(V*)(x),x  €  X. 

The  policy  7 r  improves  7r  in  that  Vw(x)  >  V'K{x)  Va:  G  X.  However,  carrying  out  the  policy 
improvement  step  may  be  impractical  for  large  A,  motivating  our  EPI  algorithm  as  an  alternative. 

3  Evolutionary  Policy  Iteration 

3.1  Algorithm  description 

As  with  all  evolutionary/GA  algorithms,  we  define  the  kth  generation  population  ( k  =  0, 1,2, ...), 
denoted  by  P(k),  which  is  a  set  of  policies  in  n,  and  n  =  \P{k)\  >  2  is  the  population  size,  which 
we  take  to  be  constant  in  each  generation.  Given  the  fixed  initial  state  probability  distribution  6 
defined  over  X ,  we  define  the  average  value  of  n  for  S  or  fitness  value  of  n: 

Js=  Y,V^x)6(x). 

xex 

Note  that  Jjf  is  simply  the  expectation  given  by  the  function  on  the  right-hand  side  of  (1),  and  an 
optimal  policy  7 r*  satisfies 

Jf  >  JJ  Vtt  g  n. 

A  high-level  description  of  the  EPI  algorithm  is  shown  in  Figure  1,  where  some  steps  (e.g.,  mutation) 
are  described  at  a  conceptual  level,  with  details  provided  in  the  following  subsections.  We  denote 
Pm  as  the  mutation  selection  probability,  Pg  the  global  mutation  probability,  and  Pi  the  local 
mutation  probability.  We  also  define  an  action  selection  distribution  /i  as  a  probability  distribution 
over  A  such  that  YlaeA  Ma)  =  1  and  ll(a)  >  0  for  all  a  G  A. 

3.2  Initialization  and  Policy  Selection 

Convergence  of  the  EPI  algorithm  is  independent  of  the  initial  population  P(0)  (to  be  shown  later), 
mainly  due  to  the  Policy  Mutation  step.  We  can  randomly  generate  an  initial  population  or  start 
with  a  set  of  heuristic  policies.  One  simple  initialization  is  a  population  of  policies  with  the  property 
that  the  same  action  is  prescribed  for  every  state,  but  each  policy  in  the  population  prescribes  a 
different  action. 

3.3  Policy  Switching 

One  of  the  basic  procedural  steps  in  GA  is  to  select  members  from  the  current  population  to  create 
a  “mating  pool”  to  which  “crossover”  is  applied;  this  step  is  called  “parent  selection”.  Similarly, 
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Evolutionary  Policy  Iteration  (EPI) 

•  Initialization: 

Select  population  size  n  and  K  >  0.  P(0)  =  {~i ttv,  }.  where  7r j  £  II. 

Set  N  =  k  =  0,  and  Pm,  Pg  and  Pi  in  (0, 1],  and  7r*(  — 1)  =  7Ti. 

•  Repeat: 

—  Policy  Switching: 

*  Obtain  Vn  for  each  7r  £  P(k). 

*  Generate  the  elite  policy  of  P(k)  defined  as 

7 T*(k)(x)  £  {argmax(Vr,r(a;))(a;)},a:  £  X. 

7T  G-P(fc) 

*  Stopping  Rule: 

•  If  jf(k)  ±  jf(k~n,  N  =  0. 

•  If  Jit  (k)  =  Jj  (k~1)  and  N  =  K,  terminate  EPI. 

•  If  jf(k)  =  jf(k~1]  and  N  <  K,  N  <-  N  +  1. 

*  Generate  n  —  1  random  subsets  Si,  i  =  1, ...,  n  —  1  of  P(k ) 

by  selecting  m  £  {2, ....  n  —  1}  with  equal  probability  and  selecting  m 
policies  in  P(k)  with  equal  probability. 

*  Generate  n  —  1  policies  i r(Si)  defined  as: 

7r (Sj)(x)  £  {argmax(V',r(a:))(a')},a;  6  X. 

7T  £Si 

—  Policy  Mutation:  For  each  policy  ir(Si),i  =  1, ....  n  —  1, 

*  Generate  a  “globally”  mutated  policy  xm(Si)  w.p.  Pm  using  PCJ  and  /i  or 
a  “locally”  mutated  policy  7r m  (Si)  w.p.  1  —  Pm  using  Pi  and  /(. 

—  Population  Generation: 

*  P{k  +  1)  =  {7 T*(k),  7 rm(Sj)},  i  =  1, ....  n  —  1. 

*  k  «—  k  +  1. 


Figure  1:  Evolutionary  Policy  Iteration  (EPI) 


we  can  design  a  “policy  selection”  step  to  create  a  mating  pool;  there  are  many  ways  of  doing  this. 
The  Policy  Switching  step  includes  this  selection  step  implicitly. 

Given  a  nonempty  subset  A  of  II,  we  define  a  policy  7r  generated  by  policy  switching  with 
respect  to  A  as 

tt(x)  G  (argmax(E7r(a:))(a:)}, x  E  X.  (4) 

7t£  A 

For  completeness,  we  show  that  the  policy  generated  by  policy  switching  improves  any  policy  in  A 
(see  also  Theorem  3  in  [1].) 

Theorem  3.1  Consider  a  nonempty  subset  A  of  II  and  the  policy  it  generated  by  policy  switching 
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with  respect  to  A  given  in  Equation  (4)-  Then,  for  all  x  E  X, 

V'K{x)  >  ma xF,r(i). 

7rG  A 

Proof:  We  begin  with  a  lemma,  which  states  a  basic  property  of  the  T^-operator  defined  by  (3). 
Lemma  3.1  Given  n  E  II,  suppose  there  exists  $  E  B(X)  for  which 

T7r($)(x)  >  <&(x),x  E  X.  (5) 

Then,  Vw(x)  >  <3?  (a;)  for  all  x  E  X . 

Proof:  By  successive  applications  of  the  T^-operator  to  both  sides  of  Equation  (5)  and  the  mono¬ 
tonicity  property  of  the  operator,  we  have  that  for  all  x  E  X , 

lim  T"(d>)(x)  >  $(®). 

n— >oo 

And  by  the  Banach  fixed  point  theorem,  limn _>.ap-T”($)(a:)  =  Vw(x),x  E  X,  which  proves  the 
lemma.  I 

Now  define  <&{x)  =  maxjgAh1^)  for  all  x  E  X.  Pick  an  arbitrary  state  x  E  X.  From  the 
definition,  there  exists  a  policy  n1  E  A  such  that  Vp  (x)  >  V7T(x)  for  all  n  E  A  and  n(x)  =  n'(x). 
It  follows  that 

T^){x)  =  R{x,tt{x))  +7  ^2  p(x,Tl{x)){y)^{y) 

yex 

=  R(x,n'(x))  +  t^  p(x’nl(x))(y)®(y) 

yex 

>  R{x,n'{x))  +7  p(x^'(x)){y)Vn'  (y) 

yex 

=  V*r{x)  =  <S>{x). 

By  the  lemma  above,  the  claim  is  proved.  I 

The  above  theorem  immediately  implies  the  following  result. 

Corollary  3.1  Consider  a  nonempty  subset  A  of  II  and  the  policy  it  generated  by  policy  switching 
with  respect  to  A  given  in  Equation  (4)-  Then,  for  any  initial  state  distribution  6, 
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We  first  generate  a  policy  ir*(k),  called  the  elite  policy  with  respect  to  the  current  population 
P(k),  which  improves  any  policy  in  P(k)  via  policy  switching.  Note  that  this  is  different  from 
the  elitist  concept  of  De  Jong  [3],  where  the  elitist  is  the  best  policy  in  P(k).  EP1  includes  the 
elite  policy  generated  by  policy  switching  unmutated  in  the  the  new  population.  By  doing  so,  the 
population  contains  a  policy  that  improves  any  policy  in  the  previous  population.  Therefore,  the 
following  monotonicity  property  holds: 

Lemma  3.2  For  any  6  and  for  all  k  >  0, 

Tir*(k)  .  TK*{k-l) 

JS  —  JS 

Proof:  The  proof  is  by  induction.  The  base  step  is  obvious  from  the  definition  of  7r*(0)  and  7r*(  — 1) 
by  Corollary  3.1.  Assume  that  Jj  -00  >  j-pi  is  for  aM 

i  <  k.  Because  the  EP1  algorithm  includes 
7 v*(k)  in  P(k  + 1),  the  elite  policy  at  k  + 1  is  generated  over  a  population  that  contains  7 v*(k),  which 
implies  that  Jj  | 

We  then  generate  n  —  l  random  subsets  Si,i  =  1,  ...,n  —  1  of  P(k)  as  follows.  We  first  select 
m  G  {2, ...,  n— 1}  with  equal  probability  and  then  select  in  policies  from  P(k)  with  equal  probability. 
By  applying  policy  switching,  we  generate  n  —  1  policies  defined  as 

7r (Si)(x)  G  {arg max(F7r(2:))(x)}, x  G  X. 

Jr  es. 

These  policies  will  be  mutated  to  generate  a  new  population  (see  the  next  subsection). 

The  policy  switching  step  is  a  key  part  in  EPI  to  speed  up  the  convergence  of  EPI.  Suppose 
that  Si  for  some  i  consists  of  two  policies  tv\  and  7T2  and  let  T  =  {x\tti(x)  /  tv2(x),x  G  X}.  Write 
7r  >  n'  if  for  all  x  G  X, 

V*{x)  >  Vn'{x) 

and  for  some  state  x  E  X, 

Vn(x)  >  V*\x) 

and  write  7r  >  7rr  if  for  all  x  G  X ,  VTK  (x)  >  VTK  (x).  Then  there  are  at  least  1^1  policies  7 fj, 
j  =  1, ...,  |T|  such  that  for  each  j,  either 

7r(5'i)  A  7T j  >  7Tl  or  TV (Sf)  >  TVj  >  7T2 

holds.  In  other  words,  by  one  application  of  policy  switching,  we  eliminate  at  least  |^|  policies  but 
at  most  |X|  in  the  search  process.  This  is  because  given  a  policy  7r,  if  we  can  improve  the  policy  n 
by  modifying  the  actions  in  in  states,  we  rule  out  at  least  in  policies  that  are  better  than  -k.  See 
Lemma  5  [7]  for  a  formal  proof. 

As  we  can  see,  policy  switching  directly  manipulates  policies  to  generate  an  improved  policy 
relative  to  all  policies  it  was  applied  to,  eliminating  the  operation  of  maximization  over  the  entire 
action  space,  which  is  the  main  computational  advantage  that  replaces  the  policy  improvement 
step  in  the  original  PI. 
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3.4  Policy  Mutation 

Policy  mutation  is  carried  out  by  altering  a  given  policy  in  the  following  manner:  for  each  state,  the 
currently  prescribed  action  is  replaced  probabilistically.  The  main  reason  for  mutating  policies  is 
to  avoid  being  caught  in  a  local  maximum,  making  a  probabilistic  convergence  guarantee  possible 
(see  the  convergence  proof  below).  We  consider  two  types  of  mutation:  “local”  and  “global”,  which 
are  distinguished  by  the  degree  of  mutation,  as  indicated  by  the  number  of  states  with  changed 
actions  in  the  mutated  policy.  To  this  end,  we  assume  that  Pi  <C  Pg,  with  P[  being  close  to  zero 
and  Pg  being  close  to  one.  The  Policy  Mutation  step  first  determines  whether  a  given  policy 
7r  is  mutated  globally  or  locally,  using  Bernoulli  probability  Pm.  If  n  is  globally  (resp.,  locally) 
mutated,  then  for  each  state  x,  n(x)  is  changed  w.p.  Pg  (resp.,  Pi),  where  the  action  to  which  it  is 
changed  would  follow  the  given  action  selection  distribution  /f.  Local  mutation  helps  the  algorithm 
fine-tune  good  policies  via  local  search,  whereas  global  mutation  helps  the  algorithm  escape  from 
local  maximum.  One  simply  way  to  select  the  particular  action  to  which  the  current  action  is 
mutated  is  to  select  randomly  (uniformly)  among  all  other  actions. 

3.5  Population  Generation  and  Stopping  Rule 

At  each  kth  generation,  the  new  population  P(k  +  1)  is  simply  given  by  the  elite  policy  generated 
from  P(k)  and  n  —  1  mutated  policies  from  n (Sj),i  =  1  —  1.  This  population  generation 

method  allows  a  policy  that  is  poor  in  terms  of  performance,  but  might  be  in  the  neighborhood  of 
an  optimal  value  located  at  the  top  of  the  very  narrow  hill,  to  be  kept  in  the  population  so  that  a 
new  search  region  can  be  started  from  the  policy.  This  helps  the  algorithm  to  avoid  being  caught 
in  the  region  of  local  optima. 

Once  we  have  a  new  population,  we  need  to  test  whether  EPI  should  terminate.  Even  if  the 
fitness  values  for  the  two  consecutive  elite  policies  are  identical,  this  does  not  necessarily  mean  that 
the  elite  policy  is  an  optimal  policy  as  in  PI.  Therefore,  we  run  the  EPI  algorithm  K  more  times  so 
that  these  random  jumps  by  the  mutation  step  will  eventually  bring  EPI  to  a  neighborhood  of  the 
optimum.  As  the  value  of  K  gets  larger,  the  probability  of  being  in  a  neighborhood  of  the  optimum 
increases.  Therefore,  the  elite  policy  at  the  termination  is  the  right  policy  with  more  confidence  as 
K  increases. 

3.6  Convergence 

Theorem  3.2  Given  Prn  >  0,  Pg  >  0,  P;  >  0,  and  an  action  selection  distribution  /t  such  that 
"YjaeA  /''(a)  =  1  and  Ma)  >  0  Va  E  A,  n*(k)  — >  7r*  w.p.  1  as  K  — >•  oo  for  any  P( 0). 

Proof:  The  proof  is  straightforward.  Observe  first  that  as  K  — >  oo,  k  — >  oo.  This  is  because  EPI 
terminates  when  N  =  K  and  if  N  ^  K ,  the  value  of  k  increases  by  one. 
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From  the  assumption,  the  probability  of  generating  an  optimal  policy  by  the  Policy  Mutation 
step  is  positive.  To  see  this,  let  a  be  the  probability  of  generating  one  of  the  optimal  policies  by  local 
mutation  and  let  / 3  the  probability  of  generating  one  of  the  optimal  policies  by  global  mutation. 
Then, 

a  -  n  pui(n*(x)) = (^),A|  •  n  //(7r* (a?))  >  o 

xex  xex 

P  >  II  P9^*(x))  =  (P9)|A|  •  n{n*{x))  >  0,  (6) 

xex  xex 

where  it*  is  a  particular  optimal  policy  in  II.  Therefore,  the  probability  of  generating  an  optimal 
policy  by  the  Policy  Mutation  step  is  positive  and  this  probability  is  independent  of  P( 0). 

Therefore,  the  probability  that  P(k)  does  not  contain  an  optimal  policy  (starting  from  an 
arbitrary  P(0))  is  at  most  ((1  —  a)PTO){n_1^((  1  — /?)( 1  —  PTO))^-1^,  which  goes  to  zero  as  k  — >  oo. 
By  Lemma  3.2,  once  P{k)  contains  an  optimal  policy,  P(k  +  rn)  contains  an  optimal  policy  for  any 
m  >  1  because  the  fitness  value  of  an  optimal  policy  is  the  maximum  among  all  policies  in  II.  This 
proves  the  claim.  ■ 

4  Parallelization 

The  EPI  algorithm  can  be  naturally  parallelized  and  by  doing  so,  we  can  improve  the  running  rate. 
Basically,  we  partition  the  policy  space  II  into  subsets  of  {11*}  such  that  (J,:  II,  =  II  and  II,  nllj  =  0 
for  all  i  A  j.  We  then  apply  EPI  to  each  II,  in  parallel,  and  then  once  each  part  terminates,  the 
best  policy  tt*  from  each  part  is  taken.  We  then  apply  policy  switching  to  the  set  of  best  policies 
{tt*  }.  We  state  a  general  result  regarding  parallelization  of  any  algorithm  that  finds  optimal  polices 
for  MDPs. 

Theorem  4.1  Given  a  partition  of  II  such  that  Utnt  =  n  and,  II*  fl  Ilj  =  0  for  all  i  A  j,  consider 
an  algorithm  A  that  generates  the  best  policy  tt*  for  II,  such  that  for  all  x  E  X, 

V1'i  ( x )  >  maxE7r(rr). 

Tren,: 

Then,  the  policy  It  defined  as 

7 t(x)  E  {argmax(F7r*  (x))(x)},x  E  X , 

A 

is  an  optimal  policy  for  II. 

Proof:  Via  policy  switching,  7r  improves  the  performance  of  each  tt*  ,  i.e. , 

V*{x)  >  maxPi  (x),x  E  X, 

implying  that  7r  is  an  optimal  policy  for  II,  since  the  partition  covers  the  entire  policy  space.  ■ 
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Note  that  we  cannot  just  pick  the  best  policy  among  n*  in  terms  of  the  fitness  value  Jj .  The 
condition  that  JJ  >  for  n  A  it'  does  not  always  imply  that  V*{x)  >  VTK  ( x )  for  all  x  E  X  even 
though  the  converse  is  true.  In  other  words,  we  need  a  policy  that  improves  all  policies  n*.  Picking 
the  best  policy  among  such  policies  does  not  necessarily  guarantee  an  optimal  policy  for  II. 

If  the  number  of  subsets  in  the  partition  is  N ,  the  overall  convergence  of  the  algorithm  A  is 
faster  by  a  factor  of  N.  For  example,  if  at  state  x.  the  action  a  or  b  can  be  taken,  let  IIi  = 
{7r |7r (a?)  =  a,7r  E  11}  and  II2  =  {7r| 7r(ic)  =6, 7r  E  II}.  By  using  this  partition,  the  convergence  rate 
of  the  algorithm  A  will  be  twice  as  fast. 

By  Theorem  4.1,  this  idea  can  be  applied  to  PI  via  policy  switching,  yielding  a  “distributed” 
PI.  We  apply  PI  to  each  II,;.  Once  PI  for  each  part  terminates,  we  combine  the  resulting  policy  for 
each  part  by  policy  switching.  The  combined  policy  is  an  optimal  policy  so  that  this  method  will 
speed  up  the  original  PI  by  a  factor  of  N  if  the  number  of  subsets  in  the  partition  is  N.  However, 
note  that  this  distributed  variant  of  PI  will  also  involve  the  operation  of  the  maximization  over  the 
action  space  in  the  policy  improvement  step.  The  result  of  Theorem  4.1  also  naturally  extends  to 
a  dynamic  programming  version  of  PI,  similarly  to  EPI.  For  example,  we  can  partition  n  by  Hi 
and  n2,  and  Hi  is  subdivided  by  Hu  and  ni2,  and  n2  by  n2i  and  n22-  The  optimal  substructure 
property  is  preserved  by  policy  switching.  Suppose  that  the  number  of  subsets  generated  in  this 
way  is  (3.  then  the  overall  computation  time  of  an  optimal  policy  is  0(/3  •  \X\  ■  C),  where  C  is  the 
maximum  size  of  the  subsets  in  terms  of  the  number  of  policies,  because  policy  switching  is  applied 
0(/3)  times  with  0(|X|)  complexity  and  C  is  the  upper  bound  on  Pi-complexity. 

5  Concluding  Remarks 

The  discussion  in  the  previous  section  raises  an  important  question  that  can  motivate  further 
research:  How  can  we  partition  the  policy  space  so  that  PI  or  EPI  converges  faster?  For  well- 
chosen  partitions,  we  may  even  be  able  to  obtain  optimal  policies  for  some  subsets  analytically. 
Much  of  the  MDP  literature  concentrates  on  aggregation  in  the  state  space  (see,  e.g.,  [4])  for  an 
approximate  solution  for  a  given  MDP.  Our  discussion  on  the  parallelization  of  PI  and  EPI  can  be 
viewed  in  some  sense  as  an  aggregation  in  the  policy  space,  where  the  distributed  version  of  EPI 
can  be  used  to  generate  an  approximate  solution  of  a  given  MDP. 

In  our  setting,  the  mutated  action  for  a  mutated  state  was  determined  (probabilistically)  by  a 
given  action  selection  distribution.  If  the  action  space  is  continuous,  say  [0,1],  a  straightforward 
implementation  might  change  only  the  least  significant  digit  for  local  mutation  and  the  most  sig¬ 
nificant  digit  for  global  mutation,  where  numbers  in  [0, 1]  are  represented  by  a  certain  number  of 
significant  digits. 

GAs  are  known  to  work  well  for  many  continuous  domain  problems  but  to  face  difficulties  of 
a  different  kind  for  problems  where  the  decision  variables  are  discrete  [9].  However,  EPI  circum- 
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vents  this  problem  via  policy  switching,  an  idea  that  has  not  been  exploited  in  the  GA  literature 
previously. 
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